library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(usmap)
library(maps)
library(ggplot2)
library(tidyverse)
## -- Attaching packages -------------------------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v tibble 2.1.3 v purrr 0.3.3
## v tidyr 1.0.0 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## -- Conflicts ----------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x purrr::map() masks maps::map()
library(ggmap)
## Google's Terms of Service: https://cloud.google.com/maps-platform/terms/.
## Please cite ggmap if you use it! See citation("ggmap") for details.
library(viridis)
## Loading required package: viridisLite
library(rgdal)
## Loading required package: sp
## rgdal: version: 1.4-8, (SVN revision 845)
## Geospatial Data Abstraction Library extensions to R successfully loaded
## Loaded GDAL runtime: GDAL 2.2.3, released 2017/11/20
## Path to GDAL shared files: C:/Users/16196/OneDrive/Documents/R/win-library/3.6/rgdal/gdal
## GDAL binary built with GEOS: TRUE
## Loaded PROJ.4 runtime: Rel. 4.9.3, 15 August 2016, [PJ_VERSION: 493]
## Path to PROJ.4 shared files: C:/Users/16196/OneDrive/Documents/R/win-library/3.6/rgdal/proj
## Linking to sp version: 1.4-1
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggmap':
##
## wind
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
require(janitor)
## Loading required package: janitor
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
require(reshape2)
## Loading required package: reshape2
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
require(tidyr)
require(rstan)
## Loading required package: rstan
## Loading required package: StanHeaders
## rstan (Version 2.19.2, GitRev: 2e1f913d3ca3)
## For execution on a local, multicore CPU with excess RAM we recommend calling
## options(mc.cores = parallel::detectCores()).
## To avoid recompilation of unchanged Stan programs, we recommend calling
## rstan_options(auto_write = TRUE)
## For improved execution time, we recommend calling
## Sys.setenv(LOCAL_CPPFLAGS = '-march=native')
## although this causes Stan to throw an error on a few processors.
##
## Attaching package: 'rstan'
## The following object is masked from 'package:tidyr':
##
## extract
require(rstanarm)
## Loading required package: rstanarm
## Loading required package: Rcpp
## Registered S3 method overwritten by 'xts':
## method from
## as.zoo.xts zoo
## rstanarm (Version 2.19.2, packaged: 2019-10-01 20:20:33 UTC)
## - Do not expect the default priors to remain the same in future rstanarm versions.
## Thus, R scripts should specify priors explicitly, even if they are just the defaults.
## - For execution on a local, multicore CPU with excess RAM we recommend calling
## options(mc.cores = parallel::detectCores())
## - bayesplot theme set to bayesplot::theme_default()
## * Does _not_ affect other ggplot2 plots
## * See ?bayesplot_theme_set for details on theme setting
##
## Attaching package: 'rstanarm'
## The following object is masked from 'package:rstan':
##
## loo
require(bayesplot)
## Loading required package: bayesplot
## This is bayesplot version 1.7.1
## - Online documentation and vignettes at mc-stan.org/bayesplot
## - bayesplot theme set to bayesplot::theme_default()
## * Does _not_ affect other ggplot2 plots
## * See ?bayesplot_theme_set for details on theme setting
Finaldata<- read.csv("time_series_final_Data.csv")
google<-read.csv("Google/googlefinal.csv")
names(google)
## [1] "X.1" "X" "Day"
## [4] "State" "ChinaVirusInterest" "KungFluInterest"
## [7] "Region"
google <- google %>% select(-c(X.1))
Finaldata <- Finaldata %>% select(-c(X))
Finaldata <- Finaldata %>% mutate(Day= as.Date(Day))
Demographic:Our dataset, Demographic, was created by merging three other datasets which all contained different demographic and election information. We obtained the main portion of our demographic data from the US Census Bureau’s American Community Survey (ACS) which is an ongoing survey administered by the U.S. Census Bureau. It gathers information on income, employment, housing characteristics, etc, annually for all the 50 U.S. States on the county and state level. To access the county-level dataset we used the R package called Choroplethr which provides API connections to data sources like the ACS. The ACS County-Level dataset was then merged with a county-level election outcome dataset that was created by Tony McGoven. Tony’s dataset contained presidential election results for 2008,2012, and 2016 but we chose to focus solely on the most recent election,2016. That said, the 2016 election results at the county-level were scraped from results published by Townhall.com. However, the State of Alaska reports results at the precinct or state level so there was no county-level data available. Therefore, another dataset had to be created that contained the election results for Alaska and this was done using the official election results provided by the Alaska Division of Elections and was later merged in. The final dataset that was used came from Alicia Johnson and it contained information on a state’s political leaning. Meaning it categorizes each county as belonging to a blue/red/purple state based on the state categorizations at 279towin.
COVID-19 CasesThe COVID-19 data is provided by The COVID Tracking Project(CTP). All of the data points come from state/district/territory public health authorities—or, occasionally, from trusted news reporting, official press conferences, or (very occasionally) tweets or Facebook updates from state public health authorities or governors. These numbers are updated daily at 4PM EST. The biggest weakness of this dataset is that there is no standardized methods for states to follow for data collection/report. For example, some states, like Oregon, provide the full set of numbers but others provide some or none of these numbers on an ongoing basis. Some crucial states in this outbreak, notably California, Washington, and New York, have not been regularly reporting their total number of people tested. The CTP aims to remedy this uncertainty in states by utilizing other reporting/measuring tools such as: “Directly asking state officials, watching news conferences, gleaning information from trusted news sources, and whatever else it takes to present reliable numbers.”
Google Search InterestThis data set includes two search interest indexes over time, measuring how people in each of the state’s interest in searching either “Kung Flu” or “China Virus” based on the time frame selected in the search. This data is downloaded directly from Google Trends which uses the same technique to track the interest of all searches on the platform. The main downside to this data set is the method of the indexing which makes the comparison from state to state less meaningful since each state is guaranteed to have a 100-level interest on their peak day, and the actual unknown search values can vary greatly across different states.
dim(Finaldata)
## [1] 765 31
names(Finaldata)
## [1] "Day" "State"
## [3] "ChinaVirusInterest" "KungFluInterest"
## [5] "positive" "negative"
## [7] "death" "hospitalized"
## [9] "totalTestResults" "FIPS"
## [11] "StayAtHome_date" "Quarantine_Yes"
## [13] "polyname" "StateColor"
## [15] "total_2016" "dem_2016"
## [17] "gop_2016" "oth_2016"
## [19] "total_population" "percent_white"
## [21] "percent_black" "percent_asian"
## [23] "percent_hispanic" "per_capita_income"
## [25] "median_rent" "median_age"
## [27] "percent_democrat2016" "percent_republican2016"
## [29] "percent_other2016" "Winner"
## [31] "Region"
head(Finaldata)
## Day State ChinaVirusInterest KungFluInterest positive negative
## 1 2020-03-10 AK 27 0 0 23
## 2 2020-03-11 AK 0 0 0 46
## 3 2020-03-12 AK 26 0 0 46
## 4 2020-03-13 AK 28 0 1 59
## 5 2020-03-14 AK 30 0 1 143
## 6 2020-03-15 AK 61 0 1 143
## death hospitalized totalTestResults FIPS StayAtHome_date Quarantine_Yes
## 1 NA NA 23 2 03/28/2020 1
## 2 NA NA 46 2 03/28/2020 1
## 3 NA NA 46 2 03/28/2020 1
## 4 NA NA 60 2 03/28/2020 1
## 5 NA NA 144 2 03/28/2020 1
## 6 NA NA 144 2 03/28/2020 1
## polyname StateColor total_2016 dem_2016 gop_2016 oth_2016
## 1 alaska red 318608 116454 163387 38767
## 2 alaska red 318608 116454 163387 38767
## 3 alaska red 318608 116454 163387 38767
## 4 alaska red 318608 116454 163387 38767
## 5 alaska red 318608 116454 163387 38767
## 6 alaska red 318608 116454 163387 38767
## total_population percent_white percent_black percent_asian
## 1 741456 0.65 0.04 0.07
## 2 741456 0.65 0.04 0.07
## 3 741456 0.65 0.04 0.07
## 4 741456 0.65 0.04 0.07
## 5 741456 0.65 0.04 0.07
## 6 741456 0.65 0.04 0.07
## percent_hispanic per_capita_income median_rent median_age
## 1 0.07 34922 NA 34.7
## 2 0.07 34922 NA 34.7
## 3 0.07 34922 NA 34.7
## 4 0.07 34922 NA 34.7
## 5 0.07 34922 NA 34.7
## 6 0.07 34922 NA 34.7
## percent_democrat2016 percent_republican2016 percent_other2016 Winner
## 1 0.3655087 0.5128151 0.1216762 Republican
## 2 0.3655087 0.5128151 0.1216762 Republican
## 3 0.3655087 0.5128151 0.1216762 Republican
## 4 0.3655087 0.5128151 0.1216762 Republican
## 5 0.3655087 0.5128151 0.1216762 Republican
## 6 0.3655087 0.5128151 0.1216762 Republican
## Region
## 1 West
## 2 West
## 3 West
## 4 West
## 5 West
## 6 West
summary(Finaldata)
## Day State ChinaVirusInterest KungFluInterest
## Min. :2020-03-10 AK : 15 Min. : 0.00 Min. : 0.000
## 1st Qu.:2020-03-13 AL : 15 1st Qu.: 26.00 1st Qu.: 0.000
## Median :2020-03-17 AR : 15 Median : 37.00 Median : 0.000
## Mean :2020-03-17 AZ : 15 Mean : 37.24 Mean : 3.903
## 3rd Qu.:2020-03-21 CA : 15 3rd Qu.: 49.00 3rd Qu.: 5.000
## Max. :2020-03-24 CO : 15 Max. :100.00 Max. :58.000
## (Other):675
## positive negative death hospitalized
## Min. : 0.0 Min. : 0 Min. : 0.000 Min. : 0.00
## 1st Qu.: 10.0 1st Qu.: 94 1st Qu.: 1.000 1st Qu.: 3.75
## Median : 38.0 Median : 337 Median : 2.000 Median : 32.50
## Mean : 322.2 Mean : 1820 Mean : 8.728 Mean : 210.91
## 3rd Qu.: 152.0 3rd Qu.: 1447 3rd Qu.: 5.000 3rd Qu.: 73.75
## Max. :25665.0 Max. :65605 Max. :210.000 Max. :3234.00
## NA's :41 NA's :416 NA's :709
## totalTestResults FIPS StayAtHome_date Quarantine_Yes
## Min. : 0 Min. : 1.00 03/24/2020: 90 Min. :0.00
## 1st Qu.: 103 1st Qu.:16.00 03/28/2020: 75 1st Qu.:1.00
## Median : 380 Median :29.00 03/30/2020: 75 Median :1.00
## Mean : 2045 Mean :28.96 03/23/2020: 60 Mean :0.88
## 3rd Qu.: 1500 3rd Qu.:42.00 03/25/2020: 60 3rd Qu.:1.00
## Max. :91270 Max. :56.00 (Other) :300 Max. :1.00
## NA's :105 NA's :15
## polyname StateColor total_2016 dem_2016
## alabama : 15 blue :285 Min. : 248742 Min. : 55949
## alaska : 15 purple:165 1st Qu.: 730628 1st Qu.: 266827
## arizona : 15 red :315 Median :1922218 Median : 779535
## arkansas : 15 Mean :2489256 Mean :1188393
## california: 15 3rd Qu.:3208899 3rd Qu.:1534487
## colorado : 15 Max. :9631972 Max. :5931283
## (Other) :675
## gop_2016 oth_2016 total_population percent_white
## Min. : 11553 Min. : 8496 Min. : 570134 Min. :0.2300
## 1st Qu.: 345598 1st Qu.: 38767 1st Qu.: 1583364 1st Qu.:0.5900
## Median : 947934 Median : 91364 Median : 4361333 Median :0.7400
## Mean :1179254 Mean :121608 Mean : 6108975 Mean :0.7029
## 3rd Qu.:1535513 3rd Qu.:183694 3rd Qu.: 6819579 3rd Qu.:0.8300
## Max. :4681590 Max. :515968 Max. :37659181 Max. :0.9400
##
## percent_black percent_asian percent_hispanic per_capita_income
## Min. :0.0000 Min. :0.01000 Min. :0.0100 Min. :20618
## 1st Qu.:0.0300 1st Qu.:0.01000 1st Qu.:0.0400 1st Qu.:24635
## Median :0.0700 Median :0.02000 Median :0.0800 Median :26824
## Mean :0.1084 Mean :0.03765 Mean :0.1082 Mean :28098
## 3rd Qu.:0.1500 3rd Qu.:0.04000 3rd Qu.:0.1300 3rd Qu.:30469
## Max. :0.4900 Max. :0.37000 Max. :0.4700 Max. :45290
##
## median_rent median_age percent_democrat2016
## Min. : 448.0 Min. :29.60 Min. :0.2249
## 1st Qu.: 564.0 1st Qu.:36.30 1st Qu.:0.3609
## Median : 658.0 Median :37.60 Median :0.4670
## Mean : 714.3 Mean :37.66 Mean :0.4501
## 3rd Qu.: 838.0 3rd Qu.:39.00 3rd Qu.:0.5335
## Max. :1220.0 Max. :43.20 Max. :0.9285
## NA's :15
## percent_republican2016 percent_other2016 Winner Region
## Min. :0.04122 Min. :0.01937 Democrat :315 Midwest :195
## 1st Qu.:0.41161 1st Qu.:0.04160 Republican:450 Mountain :120
## Median :0.49064 Median :0.05071 Northeast:180
## Mean :0.49010 Mean :0.05976 South :195
## 3rd Qu.:0.58095 3rd Qu.:0.06991 West : 75
## Max. :0.70052 Max. :0.25598
##
| Variables: | Description: |
|---|---|
|
State Name |
|
Political Leaning |
|
Percent of the Population that is Hispanic |
|
Percent of the Population that is White |
|
Percent of the Population that is Asian |
|
Percent of Population that is Black |
|
Total State Population |
|
Income per Capita |
|
Percent of votes won by Democrat (Clinton) |
|
Percent of votes won by Republican (Trump) |
|
Indicator for whether a Republican or Democrat Won |
|
Total Number of Votes |
|
Number of reported positive COVID-19 cases |
|
Number of reported negative COVID-19 cases |
|
Date of report |
|
Total Number of reported deaths due to COVID-19 |
|
Total Number of indivudals hopitalized due to COVID-19 |
|
Total Number test results (Positive +Negative) |
|
A five-digit Federal Information Processing Standards code which uniquely identified counties and county |
|
Interest index from google searches by state. Peak search day=100, all other days in set are based searches on relative to this peak day. |
|
Interest index from google searches by state. Peak search day=100, all other days in set are based searches on relative to this peak day. |
|
States divided into five different regions: West, South, Mountain, Northeast, Midwest |
|
The date states have enforced quarantine |
|
An 0,1 indicator of whether states have enforced quarantine |
Trump<- as.Date("03/16/2020", format = "%m/%d/%Y")
day1<-as.Date("03/10/2020",format = "%m/%d/%Y")
day2<-as.Date("03/24/2020",format = "%m/%d/%Y")
google_ts<-google%>%
mutate(Day = as.Date(as.character(Day)))
a<-google_ts %>%
#filter(Day<=day2)%>%
#filter(Day>=day1)%>%
group_by(Region, Day) %>%
summarize(ChinaVirusSearch = median(ChinaVirusInterest)) %>%
ggplot(aes(x=Day, y=ChinaVirusSearch, color=Region))+
geom_point()
ggplotly(a)
This plot shows the relationship of “China Virus” search interest over grouped by region. This plots shows that there are certainly key events that trigger an uptick in searches overall. In this plot it is not clear which region may search China Virus more or less often, but it does show a that the regions move together in search interest, which would imply federal level events like a Donald Trump tweet to trigger these interest spikes.
#Demographic: Identification
## Might not use, how helpful is this? Possibly combine into a shiny app so people could adjust for which race to look at?
Finaldata <- data.frame(Finaldata) %>% mutate(state = State)
plot_usmap(data = Finaldata, values = "percent_white", color = "white") +
scale_fill_continuous(name = "Percent White", label = scales::comma) +
theme(legend.position = "right")+ ggtitle("Percent of Residents that Identify as White") +
theme(
plot.title = element_text(color="Black", size=14, face="bold")
)
Our first visualization is looking at the percent of residents that identify as white within the United States. As you can see, there is a higher percent of white identifying residents in the midwest and northeast states. From this visualization we can also see that places like Texas, California, and New Mexico have much lower white identifying residents which could provide important information for us in our actual analysis.
plot_usmap(data = Finaldata, values = "percent_asian", color = "white") +
scale_fill_continuous(low = "sky blue", high = "black", name = "Percent Asian", label = scales::comma) +
theme(legend.position = "right") + ggtitle("Percent of Residents that Identify as Asian") +
theme(
plot.title = element_text(color="Black", size=14, face="bold")
)
The above visualization is looking at the percent of residents that identify as Asian within the United States. As you can see, there is a higher percent of residents that identify as Asian in places like California and Washington, but the most being in Hawaii. From this visualization we can also see that that Midwest and the South tend to have a small percentage of residents identifying as Asian. We are especially interested in the percent_asian variable as it plays a major role in our analysis.
#By income
plot_usmap(data = Finaldata, values = "per_capita_income", color = "white") +
scale_fill_continuous(low = "sky blue", high = "black", name = "$ Per Capita", label = scales::comma) +
theme(legend.position = "right") + ggtitle("Income per Capita")+
theme(
plot.title = element_text(color="Black", size=14, face="bold")
)
We also wanted to take a look at the per capita income variable, as we believe it could be valuable within our analysis. Specifically, we are thinking that it could possibly be an indicator for the amount COVID Cases and or could really affect an individuals reaction to Trump’s tweet. This visualization however, is just uni-variate as we wanted to understand the variable a bit more. As you can see, the Northeast and the West coast tend to have higher incomes per capita than places like the South for instance.
#Income by state affiliation
g <- ggplot(Finaldata, aes(x = per_capita_income, fill = StateColor)) + geom_density(alpha = .8)+ ggtitle("Income per Capita") + xlab("$") + ylab("Density")
g + scale_fill_manual(values=c("blue", "purple", "red"))
After learning more about the per capita income variable we wanted to see how the variable would change when we added State Color into the mix. Where State Color signifies a state’s political leaning during the election. What is interesting is that the states that were red had much lower incomes per capita than states that were blue. We see that the blue states had a much wider and higher spread of income per capita ranging all the way to $45,000. Purple states on the other hand tended to be in the middle.
#Trump Support
ggplot(Finaldata, aes(x = percent_republican2016)) +
geom_density()+ ggtitle("Trump Support During the 2016 Elections") + xlab("Percent of Trump Support") + ylab("Density") + facet_wrap(~Region)
This visualization is looking at Trump Support during the 2016 elections across different U.S. regions. From these visualizations it seems as though each region had pretty varying support. The Northeast had this interesting peak at 0.4 but also has another slope at around 0. The Western and Mountain regions have a similar pattern happening where there are strong peaks but then those peaks start dipping down fairly drastically. The South and the Midwest tend to lean more towards higher Trump support and we don’t see the “dipping” occur as much. This leads us to believe that during that election those regions had more states that favored Trump and the Republican party.
ggplot(Finaldata, aes(x = percent_asian, y = percent_republican2016, color = Winner)) +
scale_color_manual(values = c("blue","purple","red")) +
geom_point(alpha = 0.8) + geom_text(aes(label=ifelse(percent_asian>.2,as.character(state),"")),hjust=1.2,vjust=0)+ geom_text(aes(label=ifelse(percent_republican2016<.2,as.character(state),"")),hjust=-.1,vjust=0) + ggtitle("Percent of Asians and Trump Support by Election Outcome") + xlab("Percent Asian") + ylab("Trump Support")
This final demographic visualization is depicting Trump support across all states among those who identify as Asian, and is then classifying whether or not Trump won in that specific state. The two outliers are labeled as Hawaii and DC. These are interesting because we see that DC had a small percent of Asian residents and also had low Trump support this is kinda expected. However, we see that Hawaii had a much larger percentage of Asian identifying residents but also still had a decent amount of Trump support. All in all though, this visualization tells us that in a way the states that had lower Asian identifying residents tended to have higher Trump support.
## Creating a df of just the cases the week before 3/17
weekbefore <- data.frame(Finaldata) %>% filter(Day <= as.Date("2020-03-17"))
plot1<- plot_usmap(data = weekbefore, values = "positive", color = "white") +
scale_fill_gradient2(low = "white", high ="blue", mid = "skyblue",midpoint = 30,
name = "Cases (#)",label = scales::comma,
limits = c(0,500)) +
theme(legend.position = "right")+ ggtitle("# of COVID Cases Before 3/17") +
theme(
plot.title = element_text(color="Black", size=8)
)
## Creating a df of just the cases the week after 3/17
weekafter <- data.frame(Finaldata) %>% filter(Day >= as.Date("2020-03-17"))
plot2<-plot_usmap(data = weekafter, values = "positive", color = "white") +
scale_fill_gradient2(low = "white", high ="red", mid = "blue",midpoint = 500,
name = "Cases (#)",label = scales::comma,
limits = c(0,10000)) +
theme(legend.position = "right")+ ggtitle("# of COVID Cases After 3/17") +
theme(
plot.title = element_text(color="Black", size=5)
)
# The number of Positive Covid Cases, comparison before and after the China Virus announcement
grid.arrange(plot1, plot2, ncol=2)
This visualization allows us to compare the raw count of positive COVID-19 cases among the US before and after March 17th (Trump says “China Virus”). The goal of this visualization is to show the number of positive COVID-19 cases as the time progresses. From the first plot we can see there are very few states that have positive cases above 0-100. However, we see states like Washington, Michigan, California and New York have higher numbers. The second plot shows that while there are very few states above ~2,000 there overall a higher number of positive cases among the US. This intuitively makes sense given this is an infectious disease and we would only expect there to be more positive cases over time.
b<- ggplot(Finaldata, aes(x = ChinaVirusInterest, fill = as.factor(State))) +
geom_density(alpha = 0.5)
ggplotly(b)
We can see that the variability in Google interest in the term China Virus is has quite a large range between states. There are very few states that have high densities among the upper echelons of the interest scale but there are some interesting peaks of densities among the lower values. For example, we can see that Alaska, Wyoming and Iowa have unusal peaksaround the 25-50 range. It is is also interesitng interesting to note that there isn’t an obvious mean or median value of China Virus interest among the states.
c<- ggplot(Finaldata, aes(x = Day, y = ChinaVirusInterest, color = factor(State))) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
facet_wrap(~ State) +
theme(legend.position = "none")
ggplotly(c)
With this visualization we wanted to observe the different slopes of each slope of interest in the term China Virus over time. Arguably, the slopes of interest among the states are not significantly different. There are some interesting downward slopes in states like Wyoming but we would need to dig a bit deeper into this to fully understand what it means. However, a really interesting facet of this viusalization is that we can see a pattern among all states ofpeaks and valleys. In other words, from observation, the majority of states demonstrate a pattern correlated to the day. This would be interesting to explore further and see how it interacts with other aspects like number of COVID cases by state.
\[\begin{aligned} i = \text{region, } j = \text{date, } Y = \text{China Virus Interest}\\ Y_{ij}|b_0,b_1 \sim N(b_{0i} + b_{1i}Day_{i},\sigma^2)\\ b_{0i} \sim N(...)\\ b_{1i} \sim N(...)\\ \end{aligned}\]
set.seed(454)
Finaldata<-
Finaldata%>%
mutate(Day=as.numeric(Day))
model_1 <- stan_glmer(
ChinaVirusInterest ~ (1 | Region),
data = Finaldata, family = gaussian, chains = 4, iter = 5000*2
)
##
## SAMPLING FOR MODEL 'continuous' NOW (CHAIN 1).
## Chain 1:
## Chain 1: Gradient evaluation took 0 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1:
## Chain 1:
## Chain 1: Iteration: 1 / 10000 [ 0%] (Warmup)
## Chain 1: Iteration: 1000 / 10000 [ 10%] (Warmup)
## Chain 1: Iteration: 2000 / 10000 [ 20%] (Warmup)
## Chain 1: Iteration: 3000 / 10000 [ 30%] (Warmup)
## Chain 1: Iteration: 4000 / 10000 [ 40%] (Warmup)
## Chain 1: Iteration: 5000 / 10000 [ 50%] (Warmup)
## Chain 1: Iteration: 5001 / 10000 [ 50%] (Sampling)
## Chain 1: Iteration: 6000 / 10000 [ 60%] (Sampling)
## Chain 1: Iteration: 7000 / 10000 [ 70%] (Sampling)
## Chain 1: Iteration: 8000 / 10000 [ 80%] (Sampling)
## Chain 1: Iteration: 9000 / 10000 [ 90%] (Sampling)
## Chain 1: Iteration: 10000 / 10000 [100%] (Sampling)
## Chain 1:
## Chain 1: Elapsed Time: 3.983 seconds (Warm-up)
## Chain 1: 4.071 seconds (Sampling)
## Chain 1: 8.054 seconds (Total)
## Chain 1:
##
## SAMPLING FOR MODEL 'continuous' NOW (CHAIN 2).
## Chain 2:
## Chain 2: Gradient evaluation took 0 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2:
## Chain 2:
## Chain 2: Iteration: 1 / 10000 [ 0%] (Warmup)
## Chain 2: Iteration: 1000 / 10000 [ 10%] (Warmup)
## Chain 2: Iteration: 2000 / 10000 [ 20%] (Warmup)
## Chain 2: Iteration: 3000 / 10000 [ 30%] (Warmup)
## Chain 2: Iteration: 4000 / 10000 [ 40%] (Warmup)
## Chain 2: Iteration: 5000 / 10000 [ 50%] (Warmup)
## Chain 2: Iteration: 5001 / 10000 [ 50%] (Sampling)
## Chain 2: Iteration: 6000 / 10000 [ 60%] (Sampling)
## Chain 2: Iteration: 7000 / 10000 [ 70%] (Sampling)
## Chain 2: Iteration: 8000 / 10000 [ 80%] (Sampling)
## Chain 2: Iteration: 9000 / 10000 [ 90%] (Sampling)
## Chain 2: Iteration: 10000 / 10000 [100%] (Sampling)
## Chain 2:
## Chain 2: Elapsed Time: 4.375 seconds (Warm-up)
## Chain 2: 6.654 seconds (Sampling)
## Chain 2: 11.029 seconds (Total)
## Chain 2:
##
## SAMPLING FOR MODEL 'continuous' NOW (CHAIN 3).
## Chain 3:
## Chain 3: Gradient evaluation took 0 seconds
## Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 3: Adjust your expectations accordingly!
## Chain 3:
## Chain 3:
## Chain 3: Iteration: 1 / 10000 [ 0%] (Warmup)
## Chain 3: Iteration: 1000 / 10000 [ 10%] (Warmup)
## Chain 3: Iteration: 2000 / 10000 [ 20%] (Warmup)
## Chain 3: Iteration: 3000 / 10000 [ 30%] (Warmup)
## Chain 3: Iteration: 4000 / 10000 [ 40%] (Warmup)
## Chain 3: Iteration: 5000 / 10000 [ 50%] (Warmup)
## Chain 3: Iteration: 5001 / 10000 [ 50%] (Sampling)
## Chain 3: Iteration: 6000 / 10000 [ 60%] (Sampling)
## Chain 3: Iteration: 7000 / 10000 [ 70%] (Sampling)
## Chain 3: Iteration: 8000 / 10000 [ 80%] (Sampling)
## Chain 3: Iteration: 9000 / 10000 [ 90%] (Sampling)
## Chain 3: Iteration: 10000 / 10000 [100%] (Sampling)
## Chain 3:
## Chain 3: Elapsed Time: 4.189 seconds (Warm-up)
## Chain 3: 4.764 seconds (Sampling)
## Chain 3: 8.953 seconds (Total)
## Chain 3:
##
## SAMPLING FOR MODEL 'continuous' NOW (CHAIN 4).
## Chain 4:
## Chain 4: Gradient evaluation took 0 seconds
## Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 4: Adjust your expectations accordingly!
## Chain 4:
## Chain 4:
## Chain 4: Iteration: 1 / 10000 [ 0%] (Warmup)
## Chain 4: Iteration: 1000 / 10000 [ 10%] (Warmup)
## Chain 4: Iteration: 2000 / 10000 [ 20%] (Warmup)
## Chain 4: Iteration: 3000 / 10000 [ 30%] (Warmup)
## Chain 4: Iteration: 4000 / 10000 [ 40%] (Warmup)
## Chain 4: Iteration: 5000 / 10000 [ 50%] (Warmup)
## Chain 4: Iteration: 5001 / 10000 [ 50%] (Sampling)
## Chain 4: Iteration: 6000 / 10000 [ 60%] (Sampling)
## Chain 4: Iteration: 7000 / 10000 [ 70%] (Sampling)
## Chain 4: Iteration: 8000 / 10000 [ 80%] (Sampling)
## Chain 4: Iteration: 9000 / 10000 [ 90%] (Sampling)
## Chain 4: Iteration: 10000 / 10000 [100%] (Sampling)
## Chain 4:
## Chain 4: Elapsed Time: 4.941 seconds (Warm-up)
## Chain 4: 3.696 seconds (Sampling)
## Chain 4: 8.637 seconds (Total)
## Chain 4:
## Warning: There were 13 divergent transitions after warmup. Increasing adapt_delta above 0.95 may help. See
## http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
## Warning: Examine the pairs() plot to diagnose sampling problems
mcmc_trace(model_1)
mcmc_dens_overlay(model_1)
From what we can see from the trace plot the intercepts for each region are not very different from each other. We could see that it is possible that the midwest and the north east have a higher intercept compared to the west, but the mean is centered around 0 for almost all of them. We would rather use states to see if we see differing intercepts through states but unfortunatelly our code did not run.
Our model shows that the chains are stable maybe telling us that the model might work with lower chain iterations.
# Store the chains
model_1_df <- as.array(model_1) %>%
melt %>%
pivot_wider(names_from = parameters, values_from = value)
# Wrangle the chains
# Note the need for `...` around the sigma_sq_b
model_1_df <- model_1_df %>%
mutate(sigma_sq_w = `sigma`, sigma_sq_b = `Sigma[Region:(Intercept),(Intercept)]`) %>%
mutate(correlation = (sigma_sq_b/(sigma_sq_w+sigma_sq_b)))
ggplot(model_1_df, aes(x = correlation)) +
geom_density(alpha = 0.5)
In our correlation density plot we can see that the values are not very correlated accross regions meaning that regions is probably a bad way to segment them.
head(data.frame(summary(model_1)),-2)
## mean mcse sd
## (Intercept) 36.95340292 0.019676688 1.4371318
## b[(Intercept) Region:Midwest] 0.77427231 0.021061470 1.6126594
## b[(Intercept) Region:Mountain] 0.05328873 0.019625914 1.6513180
## b[(Intercept) Region:Northeast] 1.34912934 0.022448305 1.7568004
## b[(Intercept) Region:South] -0.22563184 0.018683640 1.5661028
## b[(Intercept) Region:West] -1.79649892 0.025021118 2.0899731
## sigma 18.91985542 0.004000357 0.4841495
## Sigma[Region:(Intercept),(Intercept)] 8.93404308 0.347485793 21.4982015
## X10. X50. X90.
## (Intercept) 35.3063220 37.0303694035 38.5142593
## b[(Intercept) Region:Midwest] -0.8645058 0.5120844666 2.7741190
## b[(Intercept) Region:Mountain] -1.8114164 0.0005233039 1.9885548
## b[(Intercept) Region:Northeast] -0.3537904 1.0316081519 3.6119103
## b[(Intercept) Region:South] -2.0501628 -0.1533411824 1.5110099
## b[(Intercept) Region:West] -4.6517569 -1.3640026881 0.2458575
## sigma 18.3031300 18.9050957874 19.5429719
## Sigma[Region:(Intercept),(Intercept)] 0.1820068 3.4244641438 20.3000808
## n_eff Rhat
## (Intercept) 5334 1.0000963
## b[(Intercept) Region:Midwest] 5863 1.0001977
## b[(Intercept) Region:Mountain] 7079 0.9999822
## b[(Intercept) Region:Northeast] 6125 1.0002203
## b[(Intercept) Region:South] 7026 0.9999207
## b[(Intercept) Region:West] 6977 1.0000634
## sigma 14647 0.9999579
## Sigma[Region:(Intercept),(Intercept)] 3828 1.0005962
The obvious improvements we need to make is exploring more models. We can see from our visualizations that are there are a wide variety of relationships we can dig into with our dataset. However, as a group, we need to decide what path(s) to take. Once we decide on which relationship(s) we want to analyze further, then this project can be enhanced with possibly animiated visualizations or even a shiny app.